from time import time
start = time()
import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.figure_factory as ff
import plotly.express as px
from sklearn.neighbors import KernelDensity
import rpy2.robjects as robjects
# Bright seaborn plots
sns.set_theme(palette="bright")
# load R extention to jupyter-lab cells
# (useful to combine python and R coding, but R is limited here, use R-studio instead
# if you gonna use only R)
%load_ext rpy2.ipython
The rpy2.ipython extension is already loaded. To reload it, use: %reload_ext rpy2.ipython
A confidence interval is a range of values representing how confidence we are about some parameter (e.g mean, mode, median, etc) of the population based only in a sample.
Usually when working with data we only have a sample of the population rather than the whole population. However, we are able to estimate a range of values of possible values of a parameter (generally the mean o $\mu$) of the population if we have a significant sample data. Thus, the central limit theorem states that the distribution of sample means approximates a normal distribution and, as you already may know, we are able to retrieve aprox. 95% of our data covering two (to be more accurate is 1.96) standard deviation ($\sigma$) to the right and left from the mean.
Getting the means of random samples of the population and retrieving the 95% of that distribution (normal distribution because the central limit theorem) give us the confidence interval of the true mean population with a confidence of 95% based on the sample, the larger the data sample size, the better the confidence interval obtained. We can also obtain other confidence intervals, for example, a confidence interval of 99% taking aprox 3$\sigma$ from the rigth and left of the distribution of the means.
If we have multiples samples, the confidence interval can be computed getting the means of the samples and then retrieving the limit inferior of the real values of the parameter as the percentile number 0.025 and the superior with aprox 0.975, the range of values between those percentiles encompasses the 95% of all the data but, if we have only one sample, we can calculate the standard error (SE), which is the standard deviation of the means of multiples samples of the population, and multiply it for 1.96 ($\sigma$)
$$ limit\ inferior = \mu - SE * 1.96 $$$$ limit\ superior = \mu + SE * 1.96 $$The general representation to calculate the CI is also shown as:
$$ CI = \mu\ \pm \ z * \frac{s}{\sqrt{n}}\ \ \ \ \ \ \ \ z=confidence\ level\ value;\ s=standard\ deviation\ of\ the\ sample;\ n=sample\ size $$Calculating the CI as shown above assumes your data is normal distributed, which is not always the case. In this situation, we also take advantage of a powerful method called bootstrapping.
Bootstrapping basically does a resampling of your sample calculating virtually any parameter of interest to us wich tries to describe the population parameter. This resampling is computed taking a sample of your sample of the same size but allowing repetition, for each sampled sample the parameter is calculated and, from here, several methods have been determined to calculate the CI: empirical bootstapping (here also known as basic or reverse), normal, percentile, etc.
Thus, for example, normal bootstrapping uses the definition of CI for normal data (see above) because the parameters obtained from the sampled samples follow a normal distribution, substituting the $mu$ for the $mu^*$ or mean of your sampled samples or the parameter of your interest and the $SE$ for the $SE^*$ again, of your sampled samples.
On the other hand, empirical defines the CI for the mean as:
$$ CI = (2\overline{x} - x^*_{1 - \frac{\alpha}{2}},\ 2\overline{x} - x^*_{\frac{\alpha}{2}}) $$$$ where\ \overline{x}\ is\ the\ mean\ of\ the\ original\ data(the\ sample) $$$$ x^*\ is\ the\ mean\ of\ the\ sampled\ samples\ found\ in\ the\ percentiles\\ $$$$ 1\ -\ \frac{\alpha}{2}\ and\ \frac{\alpha}{2}\ (these\ are\ the\ percentiles) $$# Lets create random normal data to ilustrate the interval confidence
norm_pop = np.random.normal(10, 1, 100000)
print(f"True mean population: {norm_pop.mean()}")
True mean population: 9.99892386569881
# Now lets get the confidence intervals with a confidence of 95% based on only a random sample
def ShowCI95(population:"array-like", sample_size:int):
'''
Show on screen the 95% confidence interval of the mean
This functions print on screen the range of values of the 95% confidence intervals based on a
sample of the population provided
Parameters
----------
population: array-like
array-like representing the population where a sample of size sample_size will be taken
sample_size: integer, int
size of the sample
Examples
----------
>>>import numpy as np
>>>import pandas as pd
>>>data = np.random.normal(0,1,10000)
>>>ShowCI95(data, 1000)
Sample size: 1000
Sample mean: -0.006450809541726166
True mean population is between -0.06841125689868024 and 0.055509637815227914 with a confidence of 95%
The sample mean has an aprox error of -0.006 +/- 0.062
'''
norm_sam = np.random.choice(population, sample_size)
# sem() is a pandas function to get the SE
error = pd.Series(norm_sam).sem() * 1.96
mu = norm_sam.mean()
l_inf = mu - error
l_sup = mu + error
print(f"Sample size: {norm_sam.shape[0]}")
print(f"Sample mean: {norm_sam.mean()}")
print(f"True mean population is between {l_inf} and {l_sup} with a confidence of 95%")
print(f"The sample mean has an aprox error of {round(mu, 3)} +/- {round(error, 3)}")
ShowCI95(norm_pop, 5000)
Sample size: 5000 Sample mean: 9.999882072842729 True mean population is between 9.971736335678909 and 10.028027810006549 with a confidence of 95% The sample mean has an aprox error of 10.0 +/- 0.028
# Lets now to try to find out the confidence interval using random sampling of the population
# Every sample will be of 5000 data points and 100 samples will be done
print(f"True population mean {norm_pop.mean()}")
means = []
for experiment in range(100):
mu = np.random.choice(norm_pop, 5000).mean()
means.append(mu)
means = pd.Series(means)
print(f"The mean of the sample means is {means.mean()}")
# If we take the percentiles covering the 95 percent of the means obtained, we are getting the confidence
# intervals with a confidence of 95%
P_025 = means.quantile(0.025)
P_975 = means.quantile(0.975)
print(f"The 95% confidence interval range from the value {P_025} to {P_975}")
True population mean 9.99892386569881 The mean of the sample means is 9.998894739238308 The 95% confidence interval range from the value 9.975312085880674 to 10.02294945027279
# The error increases as the sample data size decreases
for size in [10000,5000,2500,1000,500,250,125]:
ShowCI95(norm_pop, size)
print("")
Sample size: 10000 Sample mean: 10.005574696966415 True mean population is between 9.98607221306244 and 10.025077180870388 with a confidence of 95% The sample mean has an aprox error of 10.006 +/- 0.02 Sample size: 5000 Sample mean: 10.006389840895665 True mean population is between 9.97897845183047 and 10.03380122996086 with a confidence of 95% The sample mean has an aprox error of 10.006 +/- 0.027 Sample size: 2500 Sample mean: 9.927657132062514 True mean population is between 9.888139515400704 and 9.967174748724323 with a confidence of 95% The sample mean has an aprox error of 9.928 +/- 0.04 Sample size: 1000 Sample mean: 9.938950538840226 True mean population is between 9.874657825927429 and 10.003243251753023 with a confidence of 95% The sample mean has an aprox error of 9.939 +/- 0.064 Sample size: 500 Sample mean: 9.99482637010372 True mean population is between 9.909101829156397 and 10.080550911051043 with a confidence of 95% The sample mean has an aprox error of 9.995 +/- 0.086 Sample size: 250 Sample mean: 9.910705896216525 True mean population is between 9.784947006278523 and 10.036464786154527 with a confidence of 95% The sample mean has an aprox error of 9.911 +/- 0.126 Sample size: 125 Sample mean: 10.124699080548902 True mean population is between 9.962122625960186 and 10.287275535137619 with a confidence of 95% The sample mean has an aprox error of 10.125 +/- 0.163
# The mean of the sample means is pretty much the same as the population mean but, in reality, we wouldn't know that
# Now lets see the population distribution and the distributions of the sample means
# Population plot
fig, (ax_1, ax_2) = plt.subplots(1,2, figsize=(17, 8))
sns.kdeplot(norm_pop, linewidth=2, ax=ax_1)
ax_1.axvline(norm_pop.mean(), c="black", linestyle="--", label="$\mu$")
ax_1.set_title("\nPopulation distribution\n", fontdict={"size":17})
ax_1.legend(fontsize=14)
# Means plot
sns.kdeplot(means, linewidth=2, ax=ax_2)
# Mean and 95% of the data
ax_2.axvline(means.mean(), c="black", linestyle="--", label="$\mu$")
ax_2.axvline(P_025, c="orange", linestyle="dotted", linewidth=3, label="$CI$")
ax_2.axvline(P_975, c="orange", linestyle="dotted", linewidth=3)
X_fill, Y_fill = ax_2.get_lines()[0].get_data()
mask = (X_fill >= P_025) & (X_fill <= P_975)
ax_2.fill_between(X_fill[mask], Y_fill[mask], color="orange", alpha=0.7, label="$95\%\ CI$")
ax_2.set_title("\nDistribution of means (sample means)\n", fontdict={"size":17})
ax_2.legend(fontsize=14)
plt.subplots_adjust(wspace=0.1)
plt.show()
# Now let's bootstrapping for the basic and percentile method to ilustrate its power in non-normal data
# We are goint to create a Bootstrapping class just because, take your time to decode the python class below
class Bootstrapping:
'''
Bootstrapping resampling the sample
This class does two types of bootsrapping based on the mean of the sample and return
the low and upper confident intervals of the mean. The methods avaliable are basic
bootstrapping, also known as Reverse Percentile Interval or Empirical bootstrapping.
This function also computes the percentile boostrapping
Parameters
----------
data: array-like
sample to be resampled
tests: integer, int. Default 10,000
integer indicating how many times the sample will be resample. Default 10,000
Return
------
tuple: tuple with the inferior limit and upper limit of the confidence interval
Examples
--------
>>>from numpy.random import normal
>>>sample = normal(0, 1, 1000)
>>>boot = Bootstrapping(sample, tests=1000)
>>>boot.basic()
(-0.044194517957389545, 0.07898892888826403)
'''
def __init__(self, data:"array-like", tests:int=10000):
self.data = np.array(data).reshape(-1,)
self.tests = tests
self.mean = self.data.mean()
self.std = self.data.std()
def __percentileLimits(self, means, lower, upper):
return (np.percentile(means, lower), np.percentile(means, upper))
def __bootstrap(self, sample):
# Resampling the sample and computing the mean of the sample of the sample
# This is basically bootstrapping
means = []
for _ in range(self.tests):
mu = np.random.choice(self.data, self.data.shape[0], replace=True).mean()
means.append(mu)
return np.array(means).reshape(-1,)
def basic(self, confidence=95):
'''
Get the confidence intervals based on the definition:
CI = (low = 2X - x_{1 - a/2}, high = 2X - x_{a/2})
where "X" is the sample/original mean, "x" is the mean found in the {1 - a/2} or
{a/2} percentile of the sample resampled and "a" means alpha, the value of the tail
of the normal distribution
'''
means = self.__bootstrap(self.data)
alpha = (100 - confidence) / 2
lower_limit, upper_limit = self.__percentileLimits(means, alpha, 100 - alpha)
return (2 * self.mean - upper_limit, 2 * self.mean - lower_limit)
def percentile(self, confidence=95):
'''
Get the confidence intervals based on the definition:
CI = (low = x_{a/2}, high = x_{1 - a/2})
where x" is the mean found in the {1 - a/2} or {a/2} percentile of the
sample resampled and "a" means alpha, the value of the tail of the normal
distribution
'''
means = self.__bootstrap(self.data)
alpha = (100 - confidence) / 2
return self.__percentileLimits(means, alpha, 100 - alpha)
# Here is our non-normal data population and sample (no samples of the sample)
exp_pop = np.random.exponential(10, 5000)
exp_sam = np.random.choice(exp_pop ,250)
# Plotting using plotly to visualize better the means (interactive but no preview on github)
fig = ff.create_distplot([exp_pop, exp_sam], ["Population","Sample"], show_hist=False,
curve_type='kde', colors=["steelblue","orange"])
fig.add_vline(exp_pop.mean(), annotation={"text":"$\sigma$"},
annotation_position="top right", line_color="steelblue",
line_dash="dash")
fig.add_vline(exp_sam.mean(), annotation={"text":"$\sigma$"},
annotation_position="top left", line_color="orange",
line_dash="dot")
fig.update_layout(
margin=dict(l=20, r=20, t=15, b=15),
width=860,
height=500,
template="plotly_dark"
)
fig.show()
# Now let's get the CI using our class and resampling 10000 times (default value of the class)
boots = Bootstrapping(exp_sam)
# Basic and percentile with confidence of 95%
Confidence = 95
low_b, high_b = boots.basic(confidence=Confidence)
low_p, high_p = boots.percentile(confidence=Confidence)
print(f"\nTrue population mean:{exp_pop.mean()}")
print("\nCI USING BASIC AND PERCENTILE BOOTSTRAPPING")
print("\nBasic bootstrapping:")
print(f"\tThe {Confidence}% confidence interval range from the value {low_b} to {high_b}")
print("\nPercentile bootstrapping:")
print(f"\tThe {Confidence}% confidence interval range from the value {low_p} to {high_p}")
True population mean:9.88262809918132 CI USING BASIC AND PERCENTILE BOOTSTRAPPING Basic bootstrapping: The 95% confidence interval range from the value 8.414255435709055 to 10.752011086961195 Percentile bootstrapping: The 95% confidence interval range from the value 8.496958528561187 to 10.810982384481969
# Let's plot the results
data_dict = {}
data_dict['category'] = ['Basic','Percentile']
data_dict['lower'] = [low_b, low_p]
data_dict['upper'] = [high_b, high_p]
dataset = pd.DataFrame(data_dict)
plt.figure(figsize=(15,7))
y_axis = np.arange(0, dataset.shape[0] / 2, 0.5)
for lower, upper,y in zip(dataset['lower'], dataset['upper'], y_axis):
plt.plot((lower,upper),(y,y),'o-', color="orange", linewidth=3)
plt.axvline(exp_pop.mean(), linestyle="--", c="steelblue", label=f"Population $\mu={round(exp_pop.mean(),2)}$")
plt.axvline(exp_sam.mean(), linestyle="dotted", c="green", label=f"Sample $\mu={round(exp_sam.mean(),2)}$")
plt.ylim(-0.1, 1)
plt.xlim( exp_sam.mean() - (exp_sam.mean() / 5), exp_sam.mean() + (exp_sam.mean() / 5) )
plt.yticks(y_axis, list(dataset['category']), fontsize=15)
plt.legend(fontsize=12)
plt.title("\nConfidence interval (95%) of Percentile\nand Basic bootstrapping (orange line)\n", fontdict={"size":17})
plt.show()
# We don't need to code our functions/classes to calcute the CI using bootstrap. scipy has already an implementation
# for basic, percentile and another one called bca (which is a litle bit more complicated to code)
boots_types = {"percentile":np.NaN, "basic":np.NaN, "bca":np.NaN}
for key in boots_types:
b = stats.bootstrap((exp_sam,), np.mean, confidence_level=0.95, method=key)
boots_types[key] = b.confidence_interval
boots_types
{'percentile': ConfidenceInterval(low=8.503408565386506, high=10.809201549049465),
'basic': ConfidenceInterval(low=8.455806682698904, high=10.74773487091856),
'bca': ConfidenceInterval(low=8.550763081545766, high=10.84050814242377)}
# Let's plot the results of scipy
b = boots_types
data_dict = {}
data_dict['category'] = ['percentile', 'basic', 'bca']
data_dict['lower'] = [b["percentile"].low, b["basic"].low, b["bca"].low]
data_dict['upper'] = [b["percentile"].high, b["basic"].high, b["bca"].high]
dataset = pd.DataFrame(data_dict)
plt.figure(figsize=(15,7))
y_axis = np.arange(0, dataset.shape[0] / 2, 0.5)
for lower, upper,y in zip(dataset['lower'], dataset['upper'], y_axis):
plt.plot((lower,upper),(y,y),'o-', color="orange", linewidth=3)
plt.axvline(exp_pop.mean(), linestyle="--", c="steelblue", label=f"Population $\mu={round(exp_pop.mean(),2)}$")
plt.axvline(exp_sam.mean(), linestyle="dotted", c="green", label=f"Sample $\mu={round(exp_sam.mean(),2)}$")
plt.ylim(-0.1, 1.1)
plt.xlim( exp_sam.mean() - (exp_sam.mean() / 5), exp_sam.mean() + (exp_sam.mean() / 5) )
plt.yticks(y_axis, list(dataset['category']), fontsize=15)
plt.legend(loc="center right", fontsize=12, bbox_to_anchor=(0.985, 0.7))
plt.title("\nConfidence interval (95%) of Percentile\nand Basic bootstrapping (orange line)\n", fontdict={"size":17})
plt.show()
A p-value is the probability of obtaining equal or rarer observations from your data point or observation. This translates as the area under the curve of the tails of your data distribution (see above plot).
This p-value ($p$) is used to reject or not a null hypothesis. Any hypothesis that is not null is simply denoted as an alternative. To reject or not the null, an alpha ($\alpha$) value is chosen as a cutoff to denote at what probability an observation can be considered extraordinary, this $\alpha$ value is usually chosen as 0.05, although in reality it tends to be an arbitrary value. Thus, depending on the "error" one is willing to accept, $\alpha$ can vary, for example, being stricter with values of 0.001 (accepting one "error" per 1000 observations) or 0.2 (one "error" per 5 cases). Thus, with a value of $p < \alpha$ the null hypothesis can be rejected and we would go for the alternative hypothesis, otherwise, with $p > \alpha$, we would not have enough evidence to reject the null hypothesis.
On the other hand, how the heck are those hypotheses determined? Well, several "definitions" of what a null hypothesis is can be found such as "What is accepted by default", "Whatever is known or already defined", "What we do not want to test, or more formal "hypothesis that proposes that no statistical significance exists in a set of given observations", while the alternative is "What we are interested in knowing", but the reality is that they are somewhat ambiguous and personal criteria can certainly bias the determination of the hypothesis and end up choosing the wrong one (as may do, incluiding me). In fact, the choice of which hypothesis should be configured as the alternative or null is a much more complex process and will not be discussed in detail here. However, a more descriptive article about this can be found somewhere else. The reality is that many statistical tests have already defined which is their null hypothesis and, therefore, which is the alternative.
# Let's get normal data and point out those "extraordinary" observations
normal = np.random.normal(10, 5, 5000)
# tails based on alpha = 0.05 o just 0.025 for each tail (right and left)
alpha_tail = 0.05 / 2
left_tail = np.quantile(normal, alpha_tail)
right_tail= np.quantile(normal, 1 - alpha_tail)
# Plotting distribution and tails
plt.figure(figsize=(12,5))
axis = sns.kdeplot(normal)
axis.axvline(left_tail, linestyle="dotted", color="green", label=r"$\frac{\alpha}{2}$")
axis.axvline(right_tail,linestyle="dotted", color="green")
# values equal or lower than alpa
X,Y = axis.get_lines()[0].get_data()
mask_left = (X <= left_tail)
mask_right= (X >= right_tail)
axis.fill_between(x=X[mask_left], y1=Y[mask_left], color="green", alpha=0.7, label="Extraordinary observations")
axis.fill_between(x=X[mask_right], y1=Y[mask_right], color="green", alpha=0.7)
axis.legend(fontsize=9)
plt.title(f"\nNormal distribution with $\\alpha$ = {alpha_tail * 2}\n")
plt.show()
# USING THE EXAMPLE ABOVE (PLOT)....
# The p_value is the probability of getting the same or even rarer observations of you data point, for example,
# the probability of getting a value of x=5 or rarer means getting a value of 5 or 4, or 3 etc. Likewise,
# It also encompass getting values of 15 or 16 or 17 etc. Because values of x greater than 15 (kind of) has the same
# probability of getting a value of 5 and we also take into account stranger observations
adjusted_curve = stats.norm(loc=normal.mean(), scale=normal.std())
# Probability of getting equal or rarer values of x=5 (cumulative density function of 5 of only one tail)
x = 5
p_value_left = adjusted_curve.cdf(x)
# The symmetric nature of normal distribution make simple to get the probability of the right tail
p_value = p_value_left * 2
print(f"\nThe p-value of getting the same or rarer observation of our value chosen '{x}' is {p_value}\n")
The p-value of getting the same or rarer observation of our value chosen '5' is 0.3151290712605498
# As you saw above, a probability of 0.3 is not such a rare thing, It's pretty common, therefore our value of x=5
# is not special as you could think. There is even stranger observations
# Getting the exact symmetric value of x on the right tail
right_tail_value_percentile = 100 - stats.percentileofscore(normal, x)
right_tail_value = np.percentile(normal, right_tail_value_percentile)
# Estimating density of the plot because seaborn sometimes is not as good as sklearn
# This is usefull to fill under the curve the plot below
normal_copy = normal.reshape(-1,1)
samples_x_left = np.arange(min(normal_copy), x, 0.1).reshape(-1, 1)
samples_x_right = np.arange(right_tail_value, max(normal_copy), 0.1).reshape(-1, 1)
kernel_density_object = KernelDensity().fit(normal_copy)
samples_y_left = np.exp(kernel_density_object.score_samples(samples_x_left))
samples_y_right = np.exp(kernel_density_object.score_samples(samples_x_right))
# Plotting distribution and tails
plt.figure(figsize=(12,5))
axis = sns.kdeplot(normal_copy)
axis.axvline(x, linestyle="dotted", color="green", linewidth=3)
axis.axvline(right_tail_value, linestyle="dotted", color="green", linewidth=3)
# values equal or lower than x (in this case it was 5)
axis.fill_between(x=samples_x_left.reshape(-1,), y1=samples_y_left.reshape(-1,),
color="green", alpha=0.7,
label=f"$P\_value={round(p_value, 3)}$")
axis.fill_between(x=samples_x_right.reshape(-1,), y1=samples_y_right.reshape(-1,),
color="green", alpha=0.7)
axis.legend(fontsize=9)
plt.title(f"\nValues with probabilities equal or lower with $x={x};\ p\_value$\n")
plt.show()
Parametric tests are those statisctic tests that make assumptions about the parameters of the population distribution (usually normally distributed) from wich the sample is drawn. In such way, We are able to see in this category some of the following test
Every test shown above is used on different sceneries according to the data used, sample size and purpose of the research. Also, it's worth to note that every test produce a value which needs to be compared to the distribution of the repective test to get the right p-value and, therefore, reject or not the null hypotesis
This parametric test is used to compare the means between two groups (It determines if the means of the two groups are equal between each other).
T-test or Student's t-test is used when the standard deviation of the population is unknown and/or the sample size is lower than 30 (rule of thumb). T-test assumes the following about your data
The second point discussed above is not exactly true because there is a t-test for samples with unequal variances. In this way, there are 3 t-tests trying to answer different questions.
T-test paired (within-subject design): A T-test paired is used If the groups come from the same population (i.e., measure before and after a group under a experimental treatment)
T-test of two-samples or independent t-test (between-subjects design): A two-samples T-test is used if two groups come from different populations (i.e., two species or two groups of people from different cities). Independet T-test has two elemental variants. An independent T-test for samples with equal size and variance and an independent T-test for samples with equal or unequal sample size and unequal variances (This is also called a welch's test)
One sample T-test: If we only have one group that is compared to a single value, we use one-sample T-test.
On the other hand, we also need to remember those test can be applied to one-tailed and two-tailed experiments.
T-test paired
$$T = \frac{mean_1 - mean_2}{\frac{s(diff)}{\sqrt(n)}}$$$$s(diff)=standard\ deviation\ of\ the\ paired\ differences\ of\ the\ samples$$$$degree\ of\ freedom\ or\ df= n -1$$One sample T-test
$$T = \frac{\overline{x} - \mu_0}{\frac{s}{\sqrt{n}}}$$$$\overline{x}=mean\ of\ the\ sample$$$$\mu_0 = value\ to\ compare$$$$s = standard\ deviation$$$$df=n-1$$Independent T-test or two-samples T-test
Equal variances
$$T = \frac{\overline{x_1} - \overline{x_2}}{Sp \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}$$$$Sp=\sqrt{ \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 -2}}$$$$s_1; s_2 = standard\ deviation$$$$df= n_1 + n_2 - 2$$Unequal size and unequal variances (also called welch's t-test)
$$T = \frac{\overline{x_1} - \overline{x_2}}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$$$s_1;s_2 = standard\ deviation$$$$df = \frac{(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2})^2}{\frac{1}{n_1 - 1}(\frac{s_1^2}{n_1})^2 + \frac{1}{n_2 - 1}(\frac{s_2^2}{n_2})^2}$$$$s_1;s_2 = standard\ deviation$$You probably noticed something called degrees of freedom, in a nutshell, the distribution curve as the sample size decreases changes, this translates as longer and wider tails and a less high peak. Therefore, this curve of our statistic has to be refined according to the sample size, and this is what the degrees of freedom are for.
I mentioned earlier that in order to perform a t-test there must be homogeneity of variance. An F-test is a popular test to assay it.
F-test is usually (but not limited) to assays the variance among two variables and it's sensitive to non-normal data. The null hypotesis states equal variance for the two samples $var_1 = var_2$ and the alternative hypotesis of two-tailed is $var_1 \ne var_2$.
$$ F = \frac{\sigma_1^2}{\sigma_2^2}\ \ \ \ \ \ \ \sigma_1^2;\sigma_2^2 = variances\ of\ the\ population\ or\ the\ samples $$Depending if the test is set to two-tailed, right-tailed or left-tailed. The order to determine $\sigma_n^2$ vary. For intance, a two-tailed or right-tailed the larger variance between the samples is set to be the numerator, instead in a left-tailed, the smaller value of the variance becomes the numerator.
Other tests to check for homogeneity of variances usually used (sometimes even refered as "more powerfull than the F-test") are levene's test and Barlett test. Those are not going to be discussed in more detail here but It's good to know that they exist.
# Loading dataset
iris_df = sns.load_dataset("iris")
iris_df.sample(10)
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 36 | 5.5 | 3.5 | 1.3 | 0.2 | setosa |
| 40 | 5.0 | 3.5 | 1.3 | 0.3 | setosa |
| 8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
| 89 | 5.5 | 2.5 | 4.0 | 1.3 | versicolor |
| 141 | 6.9 | 3.1 | 5.1 | 2.3 | virginica |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
| 95 | 5.7 | 3.0 | 4.2 | 1.2 | versicolor |
| 12 | 4.8 | 3.0 | 1.4 | 0.1 | setosa |
| 49 | 5.0 | 3.3 | 1.4 | 0.2 | setosa |
| 76 | 6.8 | 2.8 | 4.8 | 1.4 | versicolor |
# Filtering samples of setosa and virginica for sepal_length
setosa_df = iris_df.loc[iris_df["species"] == "setosa", :].sepal_length
virginica_df = iris_df.loc[iris_df["species"] == "virginica", :].sepal_length
# Sampling 30 observations to test if the means between the two species are differents (using independent t-test)
sample_size = 30
setosa_sample = setosa_df.sample(sample_size, random_state=42)
virginica_sample = virginica_df.sample(sample_size, random_state=42)
plt.figure(figsize=(10,5))
sns.kdeplot(setosa_sample, label="setosa", linewidth=3)
sns.kdeplot(virginica_sample, label="virginica", linewidth=3)
plt.legend()
plt.title(f"Distribution aproximation using a $sample\_size={sample_size}$")
plt.show()
# Both samples seems normal and obviously the populations differs but, anyway let's do a t-test.
# First let's check if the variances are equal or not using F-test of two-tailed
# H0 = There is no difference between variances var_1 = var_2 if p>0.05
# H1 = There is difference between variances var_1 != var_2 if p<0.05
def Ftest(sample1:"array-like", sample2:"array-like") -> (float, float):
'''
F-test of the variance
This function does the two-tailed F-test to compare the variance between two populations with null
hypotesis (H0) describing equal variances and alterntive hypotesis (H1) for unequal variances
Paramaters
----------
sample1: array-like
Sample of the first population
sample2: array-like
Sample of the second population
Return
------
(float, float): F-statistic and p_value respectively
Examples
--------
>>>import scipy.stats as stats
>>>import numpy as np
>>>sample_A = [1,2,3,4,5]
>>>sample_B = [1,2,4,4,5]
>>>Ftest(sample_A, sample_B)
(1.08, 0.9423361401911694)
'''
# Variance of the samples
sigma_1 = np.var(sample1, ddof=1)
sigma_2 = np.var(sample2, ddof=1)
numerator, denominator = (sigma_1, sigma_2) if sigma_1 > sigma_2 else (sigma_2, sigma_1)
F = numerator / denominator
p_value_one_tailed = 1 - stats.f.cdf(F, np.size(sample1) - 1, np.size(sample2) - 1)
return F, p_value_one_tailed * 2
def ShowTypeVariances(p, alpha, statistic):
print(f'{statistic[0]}: statistic = {statistic[1]}; p_value = {p}')
if p < alpha:
print("\tThe variances are not equal")
else:
print("\tThe variances are equal")
alpha = 0.05
F_statistic, p_value_F = Ftest(setosa_sample, virginica_sample)
ShowTypeVariances(p_value_F, alpha, ["F-test", F_statistic])
F-test: statistic = 3.391227706345914; p_value = 0.001539251598021929 The variances are not equal
# We'll make also levene and barlett tests using the same alpha
levene_statistic, p_value_l = stats.levene(setosa_sample, virginica_sample)
# Barlett is more sensitive to non-normal data. Only used if we are really confident that the
# data is normal
barlett_statistic, p_value_b = stats.bartlett(setosa_sample, virginica_sample)
ShowTypeVariances(p_value_l, alpha, ["Levene-test", levene_statistic])
ShowTypeVariances(p_value_b, alpha, ["Barlett-test", barlett_statistic])
Levene-test: statistic = 4.564968517458503; p_value = 0.0368626178547197 The variances are not equal Barlett-test: statistic = 10.027323487542784; p_value = 0.0015423499256511143 The variances are not equal
# The variances are not equal as you already see, therefore;
# we will use an t-test for unequal variances (welch's test)
independet_ttest = stats.ttest_ind(setosa_sample, virginica_sample, equal_var=False)
if independet_ttest.pvalue < alpha:
print("The means between setosa and virginica are statistically different (mu_1 != mu_2)")
else:
print("The means between setosa and virginica are statistically equal (mu_1 = mu_2)")
The means between setosa and virginica are statistically different (mu_1 != mu_2)
An independent t-test for two species was done with unequal variances. However, using stats.ttest_rel we are able to run a paired-sample t-test and stats.ttest_1samp for one sample and by means of the parameter alternative choose among two-tailed, right or left tail.
Z-test is another parametric test to compare means between groups, more formally internet says "z-test is a statistical test to determine whether two population means are different when the sample size is large and the variances are known" As with a T-test to compare means, a Z-test has the same function but assumes that we do know the standard deviation about the population from which the sample came and that it follows a normal distribution. Similarly, a Z-test can be performed one-sample and for two independent samples, either two-tailed or one-tailed.
Two-sample
$$z = \frac{\overline{x_1}\ –\ \overline{x_2}}{ \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$$$$\sigma_1;\sigma_2=population\ standard\ deviations$$$$\overline{x_1};\overline{x_2}=sample\ means$$One-sample
$$Z=\frac{(\overline{X}-\mu_{0})}{\sigma}$$$$\overline{X}=mean\ of\ the\ sample$$$$\mu_{0}=mean\ to\ compare$$$$\sigma=population\ standard\ deviation$$In addition, a Z-test can be applied for two proportions
$$Z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1 - \hat{p})( \frac{1}{n_1} + \frac{1}{n_2})}}$$$$\hat{p}_1;\hat{p}_2 =proportions\ of\ samples$$$$\hat{p}=mean\ of\ both\ samples$$Both tests are used for comparison of means for normal data, but which one to use? Well, one may find that for sample sizes smaller than 30 a t-test seems to be used almost exclusively, while for those larger than that, a Z-test must be your choice regardless of whether or not the population standard deviation is known. This makes sense in light of the fact that a standard distribution T becomes more similar to the Z-distribution as the number of the sample size increases (central limit theorem) and the standard deviation of the population can be aproximated with the sample (that's the reason why in some notations of the Z-test you will find the formula using the difference of the means divided by the standard error of the sample instead of the standard deviation of the population, similar in the one-sample t-test does). However, strictly speaking some mention that if your sample size is larger than 30 and you do NOT know the variance or standard deviation of the population you would be making a mistake (not a big one actually) when using a Z-test, since you will get a slight error in the statistics obtained.
As mentioned earlier, the distribution curve changes with respect to the sample number, that is why a few degrees of freedom are defined for the t-test, which is not necessary for a Z distribution. The lower the degrees of freedom, a curve with wider tails is obtained, as the degrees tend to infinity, a Z-distribution will be obtained (with flatter tails).
Apparently the use of a Z-test when the sample size is greater than 30 has a historical connotation, not all statisticians had access to tables of p-value probabilities for all degrees of freedom, as opposed to a table for the Z distribution (again, where degrees of freedom are not necessary) and, since both are similar as the sample size increases, it was ok.
With that small reflection, nowadays we have programs with all those values calculated and therefore, in my personal understanding, independently of the sample size, if we do not know the standard deviation of the population we should NOT use a Z-test, but a t-test. But on the other hand, if these calculations are done manually, then it may be convenient to use a Z-test just for ease, with a minimal error that can sometimes be negligible.
## Let's calculate a Z-test of the following problem found on internet (from
# youtube channel statslectures)
# In the population, the average IQ is 100 with a standard deviation of 15. A team
# of scientists wants to test a new medication to see if it has either a positive or
# negative effect on intelligence, or no effect at all. A sample of 30 participants who
# have taken the medication has a mean of 140. Did the medication affect intelligence?
# Notes
# 1. This is a two-tailed problem because the exercise states for either a negative or positive effect,
# not only for a positive effect (right tail) or only a negative effect (left tail)
# 2. This a one sample Z-test because we are comparing a sample versus a given value. We are not comparing
# two samples or two populations
# 3. Let's set a confidence level of 95% or alpha of 0.05
#
# H0: There is no difference between the IQ mean of the participants (X_1 = 140) and the IQ of (X_2 = 100)
# X_1 = X_2
# H1; There is a difference between the IQ mean of the participants (X_1 = 140) and the IQ of (X_2 = 100)
# X_1 != X_2
std_population = 15
participants_mu = 100
mu_value = 140
alpha = 0.05
Z_statistic = (participants_mu - mu_value) / std_population
Z_distribution = stats.norm(loc=0, scale=1)
p_value_one_tail = Z_distribution.cdf(Z_statistic)
p_value_two_tail = p_value_one_tail * 2
if p_value_two_tail < alpha:
print("The drug had an effect on the participants c: !")
else:
print("The frug had no effect on the participants :c ")
print(f"p_value: {p_value_two_tail}")
The drug had an effect on the participants c: ! p_value: 0.007660761135179473
# On the other hand, here is an interactive graphic representation of the
# Z-distribution (standard normal distribution) and t-distribution using
# different degrees of freedom. It's easier to plot using a data frame
def t_distribution_standard(df, x):
return stats.t.pdf(x=x, df=df, loc=0, scale=1)
# Range of values to retrieve density of Z-distribution and T-distribution
x_values = np.arange(-5, 5.025, 0.025)
# Normal data for plotting
normal_distribution_plot = stats.norm.pdf(x=x_values, loc=0, scale=1)
t_distributions = np.empty((x_values.shape[0], 3))
df_t_distributions = pd.DataFrame()
# Retrieving the density of t-distributions based on the degrees
for degree in range(1, 31):
t_distributions[:,0] = degree
t_distributions[:,1] = x_values.copy()
t_distributions[:,2] = t_distribution_standard(degree, x_values)
df_t_distributions = pd.concat([df_t_distributions, pd.DataFrame(t_distributions)])
df_t_distributions.rename({0:"Degree", 1:"Std", 2:"Density"}, inplace=True, axis=1)
df_t_distributions["Degree"] = df_t_distributions["Degree"].transform(int)
fig = px.line(df_t_distributions, x="Std", y="Density",
animation_frame="Degree", range_y=[0,0.45],
template="plotly_dark")
fig.update_traces(line=dict(width=3), name="T-distribution", showlegend=True)
fig.update_layout(title_text='T-distributions vs Z-distribution', title_x=0.45)
fig.add_scatter(x=x_values, y=normal_distribution_plot,
line=dict(width=3,
color="orange",
dash="dash"),
name="Normal (Z-distribution)",
hovertemplate='Std=%{x}<br>Density=%{y}'
)
fig.update_layout(
margin=dict(l=30, r=150, t=50, b=10),
)
#fig["layout"].pop("updatemenus")
fig.show()
Para una descripción detallada acerca de ésta prueba estadística puede ser consultado en ANOVA análisis de varianza para comparar múltiples medias por Joaquín Amat Rodrigo) del cuál provienen las explicaciones posteriores.
Ahora sí, en breve, Anova es un conjunto de pruebas estadísticas usadas para comparar la media entre dos o más grupos. Similarmente a una t-test, Anova introduce tres tipos principales de análisis
Anova de una vía para comparar la media ente más de dos grupos con un factor y una variable respuesta (por ejemplo, la respuesta ante un fármaco (cuantitativo) para mejorar la concentración en diferentes edades [para niños, adultos o ancianos])
Anova para múltiples factores y una respuesta (por ejemplo, la respuesta ante un fármaco (cuantitativo) para mejorar la concentración en diferentes sexos [para hombre o mujer] de distintas edades [para niños, adultos o ancianos])
Anova para datos dependientes, pareados o repetidos (por ejemplo, la respuesta ante un fármaco (cuantitativo) para mejorar la concentración antes y depués de haber sido ingerido para distintas edades [para niños, adultos o ancianos])
En general, aquellas pruebas comparten algunos supuestos
Indepencia: Las observaciones deben ser aleatorias y los grupos o factores independientes entre ellos (supuesto no válido para Anova de datos repetidos)
Homocedasticidad: La varianza dentro de los grupos debe de ser aproximadamente igual en todos ellos. (Supuesto no válido para ANOVA heterodástica que emplea la corrección de Welch o Welch test)
Distribución normal enter grupos
Por otro lado, cada uno de las pruebas Anovas mencionadas antes podrán tener menos supuestos o supuestos adicionales.
El estadístico obtenido por una prueba de Anova es el F_{ratio} que sigue una distribución "F de Fisher-Snedecor" y se elabora bajo la hipótesis nula que la media entre todos los grupos estudiados es la misma o no hay diferencia, también denotado como $\mu_1 = \mu_2 = \mu_3 ...... = \mu_n$, por el contrario, la hipótesis alternativa establece que la media de al menos dos grupos difieren de forma significativa. Nota así, que una prueba Anova no nos indica cuáles de todos esos grupos sus medias resultaron distintas. Para ello, métodos post-hoc (métodos despúes de rechazar la hipótesis nula) como el Tukey-Kramer Honest Significant Difference (HSD) o comparaciones pareadas de t-test con correcciones para evitar inflación del error tipo uno como el Holm–Bonferroni Adjustment son usados.
Mientras tanto un estadístico que acompaña a una prueba ANOVA es el tamaño del efecto, denotado como $\eta^2$ y siendo éste es el valor que permite medir cuanta varianza en la variable dependiente cuantitativa es resultado de la influencia de la variable cualitativa independiente, o lo que es lo mismo, cuanto afecta la variable independiente (factor) a la variable dependiente y es calculada como:
$$\eta^2 = \frac{SS_{effect}}{SS_{total}}$$$$SS_{effect}= The\ sum\ of\ squares\ of\ an\ effect\ for\ one\ variable$$$$SS_{total}= The\ total\ sum\ of\ squares\ in\ the\ ANOVA\ model$$Los niveles de clasificación más empleados para el tamaño del efecto son:
$$0.01 = pequeño$$$$0.06 = mediano$$$$0.14 = grande$$%%R
data_1 <- c(13,11,8,11,8)
data_2 <- c(13,11,14,14,10)
data_3 <- c(4,1,3,4,2)
data_df_1 <- data.frame(data=data_1)
data_df_1$Groups <- "A"
data_df_2 <- data.frame(data=data_2)
data_df_2$Groups <- "B"
data_df_3 <- data.frame(data=data_2)
data_df_3$Groups <- "C"
data <- rbind(data_df_1, data_df_2, data_df_3)
anova <- aov(data ~ Groups, data)
summary(anova)
Df Sum Sq Mean Sq F value Pr(>F) Groups 2 16.13 8.067 2.142 0.16 Residuals 12 45.20 3.767
%%R
plot(anova)
?robjects.environments
Type: module String form: <module 'rpy2.robjects.environments' from '/home/lromero/mambaforge/envs/DataScience/lib/python3.10/site-packages/rpy2/robjects/environments.py'> File: ~/mambaforge/envs/DataScience/lib/python3.10/site-packages/rpy2/robjects/environments.py Docstring: <no docstring>